Skip to content

Sync NEON optimized kernels to current upstream PR (#19119)#7

Merged
jgibson2 merged 1 commit intopolycamfrom
jgibson/sync-neon-to-upstream
Apr 29, 2026
Merged

Sync NEON optimized kernels to current upstream PR (#19119)#7
jgibson2 merged 1 commit intopolycamfrom
jgibson/sync-neon-to-upstream

Conversation

@jgibson2
Copy link
Copy Markdown
Collaborator

Summary

Replaces the in-tree NEON grid_sampler_2d / sum implementation on polycam with the version that's been through three rounds of review on pytorch/executorch#19119. No new functionality — purely catching us up to the version that's about to land upstream.

What changed vs the version on polycam

  1. Hardware fp16 vs software fp16 split. The original polycam version compiled the entire op_grid_sampler_2d.cpp with -march=armv8.2-a+fp16, which raises SIGILL on ARMv8 chips that lack the +fp16 extension. The new version moves the HW-fp16 path into a dedicated TU (op_grid_sampler_2d_fp16_hw.cpp + .h) compiled as a separate buck cxx_library / CMake OBJECT library so the -march flag stays scoped to that one TU. The dispatcher in op_grid_sampler_2d.cpp picks between the HW path and a software-convert NEON path at runtime via cpuinfo_has_arm_neon_fp16(). Plain ARMv8 chips run the SW path with no illegal instructions; ARMv8.2+ chips get the faster HW path.

  2. Forward declaration of grid_sampler_2d_bilinear_fp16_hw extracted into op_grid_sampler_2d_fp16_hw.h so the definition site has a visible prototype (fixes -Wmissing-prototypes).

  3. dtypes_match gate at the top of the fast-path eligibility check (input.scalar_type() == grid.scalar_type() == out.scalar_type()). The reinterpret_cast<__fp16*> in the HW kernel and the data_ptr<T>() calls in the other paths each assume a single dtype across all three tensors; mismatched buffers would silently corrupt output. Falls back to portable when types disagree.

CMake side switches from set_source_files_properties on a single source to a dedicated OBJECT library (grid_sampler_2d_fp16_hw_impl) linked into optimized_kernels via $<BUILD_LOCAL_INTERFACE:>. Mirrors the buck cxx_library of the same name.

What is NOT touched

  • kernels/optimized/verify.cpp — polycam-specific on-device verification binary, stays as-is.
  • tools/cmake/common/preset.cmake — polycam's DESCRIPTION quoting fix, stays as-is.

Test plan

  • CMake build clean: cmake --build cmake-out --target optimized_kernels (macOS arm64, EXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON).
  • op_grid_sampler_2d_fp16_hw.cpp.o archived into liboptimized_kernels.a (verified via ar -t).
  • Re-run the on-device verify binary (verify_optimized_kernels) on a Pixel 9 to confirm fp16 HW path still produces matching outputs.
  • Re-run the depth model end-to-end to confirm latency unchanged.

🤖 Generated with Claude Code

Replace the in-tree NEON grid_sampler_2d / sum implementation on polycam
with the version that's gone through review on pytorch#19119.
The diff is essentially a fast-forward of three rounds of upstream review
feedback that we hadn't carried back here:

  1. Hardware fp16 vs software fp16 split. The original polycam version
     compiled the entire op_grid_sampler_2d.cpp with -march=armv8.2-a+fp16,
     which raises SIGILL on ARMv8 chips that lack the +fp16 extension. The
     new version moves the hardware-fp16 path into a dedicated TU
     (op_grid_sampler_2d_fp16_hw.cpp + .h) compiled as a separate buck
     cxx_library / CMake OBJECT library so the -march flag stays scoped
     to that single TU. The dispatcher in op_grid_sampler_2d.cpp picks
     between the HW path and a software-convert NEON path at runtime via
     cpuinfo_has_arm_neon_fp16(). Plain ARMv8 chips run the SW path with
     no illegal instructions; ARMv8.2+ chips get the faster HW path.

  2. Forward declaration of grid_sampler_2d_bilinear_fp16_hw extracted
     into op_grid_sampler_2d_fp16_hw.h so the definition site has a
     visible prototype and -Wmissing-prototypes builds stay green.

  3. dtypes_match gate at the top of the fast-path eligibility check
     (input.scalar_type() == grid.scalar_type() == out.scalar_type()).
     The reinterpret_cast<__fp16*> calls in the HW kernel and the
     data_ptr<T>() calls in the other paths each assume a single dtype
     across all three tensors; mismatched buffers would silently corrupt
     output. Falls back to portable when types disagree.

CMake side switches from set_source_files_properties on a single source
to a dedicated OBJECT library (grid_sampler_2d_fp16_hw_impl) linked into
optimized_kernels via $<BUILD_LOCAL_INTERFACE:>; mirrors the buck
cxx_library of the same name. Verify binary (kernels/optimized/verify.cpp)
and the preset.cmake DESCRIPTION quoting fix are unchanged — those are
polycam-specific tooling, not part of the NEON sync.

Build verified on macOS arm64 with EXECUTORCH_BUILD_KERNELS_OPTIMIZED=ON;
op_grid_sampler_2d_fp16_hw.cpp.o is archived into liboptimized_kernels.a
as expected.
@jgibson2 jgibson2 merged commit 929a1c9 into polycam Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant